HEAD
Prompt: [3]
ChatGPT/AI disclosure statement: [describe whether and how you used ChatGPT or other AI tools for this assignment. If you did not, please write “I did not use ChatGPT or other AI for this assignment.”]
At the 2024 Olympics, a team known as “Individual Neutral Athletes” was established, allowing athletes from countries under international sanctions, such as Russia and Belarus, to compete fairly. However, the impact of geopolitical tensions on their training effectiveness and final performance remains uncertain. This event has highlighted the intricate relationship between national politics and international sports competitions. I have been employed by the International Olympic Committee (IOC) to compile new data and examine how the quality of national governance affects Olympic performance. The findings from this research will provide the IOC with insights necessary to make strategic decisions, especially concerning the reallocation of resources to enhance fairness and efficiency in future Olympic Games, thus ensuring the Olympics remain a neutral platform where sportsmanship triumphs over political strif
Prompt: [2]
ChatGPT/AI disclosure statement: [When I use RSelenium, I cannot find the content outside the drop-down list. I asked chatGPT to solve it. He told me that I need to use JavaScript(row 245-258)]
In 2024, the Olympic Games celebrated gender parity in athlete participation for the first time. This prompts vital questions: How has women’s participation evolved, and does their performance match that of men? Furthermore, how do these developments relate to gender equality?
As a PhD student in social sciences, my research focuses on data on the participation and performance of female athletes from different region throughout Olympic history, alongside statistics of global gender equality. This study seeks to understand the relationship between advancements in gender parity in sports and overall gender equality worldwide. The insights derived from this research will provide valuable perspectives for fostering more inclusive sporting environments globally. Before start, it’s a good idea to look at the following code: ### 1.1: Install nesssary packages
# You may need to run these codes because your environment may not have these packages
if (!require(knitr)) install.packages("knitr")
if (!require(rvest)) install.packages("rvest")
if (!require(dplyr)) install.packages("dplyr")
if (!require(tidyr)) install.packages("tidyr")
if (!require(purrr)) install.packages("purrr")
if (!require(tibble)) install.packages("tibble")
if (!require(readr)) install.packages("readr")
if (!require(stringr)) install.packages("stringr")
if (!require(httr)) install.packages("httr")
if (!require(jsonlite)) install.packages("jsonlite")
if (!require(keyring)) install.packages("keyring")
if (!require(RSelenium)) install.packages("RSelenium")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(grid)) install.packages("grid")
if (!require(fs)) install.packages("fs")
if (!require(png)) install.packages("png")
if (!require(DBI)) install.packages("DBI")
if (!require(RSQLite)) install.packages("RSQLite")
if (!require(plotly)) install.packages("plotly")
# Run the following to set up the directory to ensure you have the correct path when starting from any code block
rowdata_ddir <- "rowdata"
if (!dir.exists(rowdata_ddir)) dir.create(rowdata_ddir)
output_ddir <- "output"
if (!dir.exists(output_ddir)) dir.create(output_ddir)
tidydata_ddir <- "output/tidydata"
if (!dir.exists(tidydata_ddir)) dir.create(tidydata_ddir)
readydata_ddir <- "output/readydata"
if (!dir.exists(readydata_ddir)) dir.create(readydata_ddir)
database_ddir <- "output/database"
if (!dir.exists(database_ddir)) dir.create(database_ddir)
vis_ddir <- "output/vis"
if (!dir.exists(vis_ddir)) dir.create(vis_ddir)
# If you need to re-scraping when using Rselenium, you will encounter port occupancy issues. In this case, run this code and replace the four numbers in the original "port" with the result.
random_port_number <- sample(1000:9999, 1)
print(random_port_number)
## [1] 1532
This study’s primary data includes historical Olympic medal
information from Wikipedia and detailed participant
statistics for each games from Olympedia. Medal tables were
extracted via xpath and CSS. I also used the
API from Kaggle(also
use Olympedia as source) to get detailed data on athletes prior to 2016
and Rselenium to fill in the missing parts.See License.
This is quicker, but my code still allows you to get first-hand data
using Rselenium.The data are primary
because they not only help to analyze the gender disparities and
participation trends in different countries over time, but also to lay
the foundation for visualization. Our scraping script is designed with
good scraping etiquette. Plus, I’ve stored the API key in advance, so
you don’t need to know it to use the code. Due to political disputes,
participating regions cannot be mapped one by one to
the countries. I will rename them as “region” in the following sections.
Do not use for visual mapping.
library(rvest)
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(readr)
<<<<<<< HEAD
# STEP 1: Scrape all Wikipedia pages of Summer Olympics medal table
# Manually specify Olympic years, considering historical three interruptions cause by World War
# P.S. Although the Tokyo Olympics was actually be held in 2021, the website shows it as 2020
olympic_years <- c(seq(1896, 1912, by = 4), 1920, seq(1924, 1936, by = 4), 1948, seq(1952, 2016, by = 4), 2020, 2024)
# Base URL for Wikipedia Olympic medal tables
base_url <- "https://en.wikipedia.org/wiki/"
# List to store HTML content of each Olympic medal table page
=======
library(stringr)
rowdata_ddir <- "rowdata"
if (!dir.exists(rowdata_ddir)) dir.create(rowdata_ddir)
# STEP 1: Scrape all Wikipedia pages of Summer Olympics medal table
# Manually specify Olympic years, considering historical three interruptions cause by World War
# P.S. Although the Tokyo Olympics was actually be held in 2021, the website shows it as 2020
olympic_years <- c(seq(1896, 1912, by = 4), 1920, seq(1924, 1936, by = 4), 1948, seq(1952, 2016, by = 4), 2020, 2024)
base_url <- "https://en.wikipedia.org/wiki/"
>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
olympic_pages <- list()
# Scrape pages for all specified Olympic years
for (year in olympic_years) {
url_suffix <- sprintf("%d_Summer_Olympics_medal_table#Medal_table", year)
olympic_url <- paste0(base_url, url_suffix)
<<<<<<< HEAD
olympic_pages[[as.character(year)]] <- read_html(olympic_url)# Fetch and store HTML content
Sys.sleep(2)# Pause to respect the server's load
}
# STEP 2: Extract medal tables and host coutry data using XPath
# Define XPaths based on pages of different years
special_years <- c(1936, 1948, 1960, 1964, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020, 2024)
xpath_years_default <- '//*[@id="mw-content-text"]/div[1]/table[2]/tbody'
xpath_years_special <- '//*[@id="mw-content-text"]/div[1]/table[3]/tbody'
# To store extracted tables
medal_tables <- list()
# Locate tables with the specified structures
for (year in names(olympic_pages)) {
page <- olympic_pages[[year]]
# Determine the correct XPath based on the year
xpath <- if (as.numeric(year) %in% special_years) xpath_years_special else xpath_years_default
table_elements <- page %>% html_elements(xpath = xpath)
# Output message if fail for further progress
if (length(table_elements) > 0 && !any(inherits(table_elements, "xml_missing"))) {
medal_tables[[year]] <- html_table(table_elements[[1]], fill = TRUE)
} else {
message(sprintf("No medal table found for year %s, skipping...", year))
}
Sys.sleep(2)# Pause to respect the server's load
}
# Prepare a list to store host countries
host_countries_data <- list()
# Scrape host country data using a uniform CSS selector
for (year in names(olympic_pages)) {
page <- olympic_pages[[year]]
# Check for special conditions
if (as.numeric(year) == 1956) {
host_countries_data[[year]] <- "Australia" # Manually set for 1956,which has different css
} else if (as.numeric(year) == 2024) {
host_countries_data[[year]] <- "France" # Manually set for 2024, which has different css
} else {
# Fetch host country using a uniform CSS selector and store it
host_country_node <- page %>% html_node(".infobox-data.location a")
host_country <- if (!is.null(host_country_node)) html_text(host_country_node, trim = TRUE) else "Unknown"
host_countries_data[[year]] <- host_country
}
Sys.sleep(1) # Pause to respect the server's load
}
# Step 3: Process and clean extracted medal_table data
=======
olympic_pages[[as.character(year)]] <- read_html(olympic_url)
Sys.sleep(1)
}
# STEP 2: Extract medal tables using XPath
special_years <- c(1936, 1948, 1960, 1964, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020, 2024)
xpath_years_default <- '//*[@id="mw-content-text"]/div[1]/table[2]/tbody'
xpath_years_special <- '//*[@id="mw-content-text"]/div[1]/table[3]/tbody'
medal_tables <- list()
for (year in names(olympic_pages)) {
page <- olympic_pages[[year]]
xpath <- if (as.numeric(year) %in% special_years) xpath_years_special else xpath_years_default
table_elements <- page %>% html_elements(xpath = xpath)
if (length(table_elements) > 0 && !any(inherits(table_elements, "xml_missing"))) {
table <- html_table(table_elements[[1]], fill = TRUE)
# Specific year check for manual correction
if (year == "1908") {
# Correct "Great Britain" name issue by removing unwanted text
table$Nation <- gsub("Great Britain.*", "Great Britain", table$Nation)
}
medal_tables[[year]] <- table
} else {
message(sprintf("No medal table found for year %s, skipping...", year))
}
Sys.sleep(1)
}
# STEP 3: Extract host countries
host_url <- "https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities"
host_country <- read_html(host_url) %>%
html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[1]') %>%
html_table(fill = TRUE) %>%
select(Country, Year) %>%
mutate(
Year = as.numeric(Year),
Country = str_remove(Country, "\\[.*?\\]")
) %>%
filter(Year %in% olympic_years)
host_countries <- host_country %>% deframe()
# STEP 4: Process and clean extracted medal_table data
>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
# Initialize an empty tibble for storing results
olympic_medal_data <- tibble(
olympic_year = integer(),
olympic_rank = integer(),
country_participate = character(),
medal_gold = integer(),
medal_silver = integer(),
medal_bronze = integer(),
medal_total = integer(),
country_host = integer())
# Process each page's medal table and integrate host country data
for (year in names(medal_tables)) {
table <- medal_tables[[year]]
<<<<<<< HEAD
# Remove the last row which usually contains totals
if (nrow(table) > 0) {
table <- table[-nrow(table), ]
}
# Check and find the table has 'NOC' or 'Nation' as the column name and standardize it
if ("NOC" %in% names(table)) {
country_col_name <- "NOC"
} else if ("Nation" %in% names(table)) {
country_col_name <- "Nation"
}
# Transform the HTML table into a cleaned and structured tibble
medal_data <- table %>%
rename(country_participate = {{country_col_name}}) %>%
mutate(
olympic_year = as.integer(year),
olympic_rank = as.integer(Rank),
country_participate = gsub("\\[.*?\\]|\\*|‡|\\(.*?\\)", "", country_participate), # Remove annotations, superscripts, and content in parentheses
medal_gold = as.integer(Gold),
medal_silver = as.integer(Silver),
medal_bronze = as.integer(Bronze),
medal_total = as.integer(Total),
country_host = ifelse(country_participate == host_countries_data[[year]], 1, 0)
) %>%
select(olympic_year, olympic_rank, country_participate, medal_gold, medal_silver, medal_bronze, medal_total, country_host)
# Append this year's data to the main Olympic medal dataset
olympic_medal_data <- bind_rows(olympic_medal_data, medal_data)
}
# Print the final structured data for verification
head(olympic_medal_data)
## # A tibble: 6 × 8
## olympic_year olympic_rank country_participate medal_gold medal_silver
## <int> <int> <chr> <int> <int>
=======
# Remove the last row which contains totals
table <- table[-nrow(table), ]
table <- table %>%
rename_with(~ if_else(.x %in% c("NOC", "Nation"), "country_participate", .x)) %>%
filter(country_participate != "Mixed team") # Remove non-nation variables
# Clean and convert data types
medal_data <- table %>%
mutate(
medal_gold = as.numeric(Gold),
medal_silver = as.numeric(Silver),
medal_bronze = as.numeric(Bronze),
medal_total = as.numeric(Total),
olympic_year = as.integer(year),
olympic_rank = as.integer(Rank),
country_participate = gsub("\\[.*?\\]|\\*|‡|\\u00a0", "", country_participate) %>% trimws(),
country_participate = if_else(
olympic_year %in% c(1932, 1960),
gsub("[^A-Za-z ]", "", country_participate) %>%
gsub("[A-Z]{3}$", "", .) %>%
trimws(),
country_participate ),
country_host = ifelse(
country_participate == host_country %>% filter(Year == as.numeric(year)) %>% pull(Country), 1, 0)) %>%
select( olympic_year, olympic_rank, country_participate,
medal_gold, medal_silver, medal_bronze, medal_total, country_host)
olympic_medal_data <- bind_rows(olympic_medal_data, medal_data)
}
write_csv(olympic_medal_data, file.path(rowdata_ddir, "olympic_medal_data.csv"))
print(head(olympic_medal_data))
## # A tibble: 6 × 8
## olympic_year olympic_rank country_participate medal_gold medal_silver
## <int> <int> <chr> <dbl> <dbl>
>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03
## 1 1896 1 United States 11 7
## 2 1896 2 Greece 10 18
## 3 1896 3 Germany 6 5
## 4 1896 4 France 5 4
## 5 1896 5 Great Britain 2 3
## 6 1896 6 Hungary 2 1
<<<<<<< HEAD
## # ℹ 3 more variables: medal_bronze <int>, medal_total <int>, country_host <dbl>
write_csv(olympic_medal_data, "olympic_medal_data.csv") # Save results as a CSV file
The primary data for this study comprises historical Olympic medal information sourced from publicly accessible Wikipedia pages for each Summer Olympic event from 1896(earliest) to 2024(latest). The choice of Wikipedia is due to its comprehensive and publicly reviewed datasets. This data was programmatically retrieved using R with the rvest package.Through Xpath and CSS, we extract and parse HTML pages to retrieve medal tables and country details.This data is primary because it not only serves as a foundation for assessing national performances across Olympics but also supports integrated analysis with political datasets and underpins the creation of visual maps. One limitation is the inherent untidiness of the scraped data, requiring futher preprocessing in sections 5 and 6 below. Moreover, the XPath used for scraping may vary, especially with future Olympic pages, necessitating periodic script adjustments to maintain data accuracy. To address potential ethical concerns and align with good scraping etiquette, our data collection process includes measures such as adhering to Wikipedia’s robots.txt, implementing Sys.sleep() to moderate our request rate, and use list to store the result of running read_html() only once, reducing the number of requests. Please follow these parts.
[The text and code for this section goes here.]
[The text and code for this section goes here.]
[The text and code for this section goes here.]
[The text and code for this section goes here.]
======= ## # ℹ 3 more variables: medal_bronze <dbl>, medal_total <dbl>, country_host <dbl>This is the historical medal table obtained from wiki.
library(httr)
library(jsonlite)
library(tidyverse)
library(keyring)
library(tibble)
library(RSelenium)
# NOTE: Basic bio data on athletes and medal results from Athens 1896 to Rio 2016 have been published on the [Kaggle](https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results) , and it has been generously allowed to be used for free via API. Direct use not only reduces the workload, but also reduces the pressure on crawling the server, so I will only crawl data for 2020.But my steps can be used for all years
# Retrieve the API key from keyring
# For the purpose of password protection, I have already set the API key, so you don't have to do step in the next row
# keyring::key_set(service = "Kaggle", username = "olympic-api-key")
api_key <- keyring::key_get(service = "Kaggle", username = "olympic-api-key")
# Download zip file
url <- "https://www.kaggle.com/api/v1/datasets/download/heesoo37/120-years-of-olympic-history-athletes-and-results"
output_file <- "rowdata/120-years-of-olympic-history-athletes-and-results.zip"
response <- GET(url, add_headers(Authorization = paste("Bearer", api_key)), write_disk(output_file, overwrite = TRUE))
# Unzip the file
zip_path <- "rowdata/120-years-of-olympic-history-athletes-and-results.zip"
extraction_path <- "rowdata/120-years-of-olympic-history-athletes-and-results"
unzip(zipfile = zip_path, exdir = extraction_path)
extracted_files <- list.files(extraction_path, full.names = TRUE)
# Load the file
olympic_athletes_rowdata <- read_csv(extracted_files[1])
NOC_rowdata <- read_csv(extracted_files[2])
# Convert the DataFrame to a tibble
olympic_athletes_rowdata <- as_tibble(olympic_athletes_rowdata)
NOC_rowdata <- as_tibble(NOC_rowdata)
# Merge two file and filter by Olympic years (summer)
mapping <- NOC_rowdata %>%
select(NOC, region) %>%
distinct()
olympic_athletes_data <- olympic_athletes_rowdata %>%
left_join(mapping, by = "NOC") %>%
mutate(Team = ifelse(is.na(region), Team, region)) %>%
filter(Year %in% olympic_years) %>%
select(Name, Sex, Team, Year, Sport, Event, Medal)
# Filter by female
olympic_female_athletes_data <- olympic_athletes_data %>%
filter(Sex == "F")%>%
select(-Sex)
print(head(olympic_female_athletes_data))
## # A tibble: 6 × 6
## Name Team Year Sport Event Medal
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 "Christine Jacoba Aaftink" Netherlands 1988 Speed Sk… Spee… <NA>
## 2 "Christine Jacoba Aaftink" Netherlands 1988 Speed Sk… Spee… <NA>
## 3 "Christine Jacoba Aaftink" Netherlands 1992 Speed Sk… Spee… <NA>
## 4 "Christine Jacoba Aaftink" Netherlands 1992 Speed Sk… Spee… <NA>
## 5 "Cornelia \"Cor\" Aalten (-Strannood)" Netherlands 1932 Athletics Athl… <NA>
## 6 "Cornelia \"Cor\" Aalten (-Strannood)" Netherlands 1932 Athletics Athl… <NA>
This table shows information about female Olympic competitors throughout history
# Start RSelenium Driver for Firefox
driver <- rsDriver(browser = "firefox", port = 4445L, verbose = FALSE)
remote_driver <- driver$client
# Navigate to the Olympedia website
url <- "https://www.olympedia.org/statistics/medal/athlete"
remote_driver$navigate(url)
Sys.sleep(3)
# Define all valid year positions in the dropdown
year_positions <- c(1:3, 5:26, 53, 54, 59, 61)
# Initialize a variable to store only 2020 data
medal_data_2020 <- NULL
# Loop through all valid year positions
for (pos in year_positions) {
# Use JavaScript to directly select the year based on position
tryCatch({
### * the following lines of code were generated by AI/ChatGPT
remote_driver$executeScript(sprintf("
var dropdown = document.getElementById('edition_select');
dropdown.value = %d; // Select the option with the given value
var event = new Event('change'); // Trigger a change event
dropdown.dispatchEvent(event);
", pos))
Sys.sleep(2)
}, error = function(e) {
next
})
# Select "Female" in the gender dropdown
tryCatch({
gender_dropdown <- remote_driver$findElement(using = 'xpath', value = '//*[@id="gender"]')
gender_dropdown$clickElement()
Sys.sleep(0.5)
female_option <- remote_driver$findElement(using = 'xpath', value = '//*[@id="gender"]/option[3]')
female_option$clickElement()
Sys.sleep(2)
}, error = function(e) {
next
})
# Extract the page source and parse the medal table
tryCatch({
page_source <- remote_driver$getPageSource()[[1]]
page_html <- read_html(page_source)
medal_table <- page_html %>%
html_table(fill = TRUE) %>%
.[[1]]
year_text <- sprintf("Position %d", pos)
medal_table <- medal_table %>% mutate(Year_Position = year_text)
# If it's 2020 (position 61), store the data
if (pos == 61) {
medal_data_2020 <- medal_table
}
}, error = function(e) {
next
})
}
# Save only the 2020 medal data to a CSV file
if (!is.null(medal_data_2020)) {
write_csv(medal_data_2020, file.path(rowdata_ddir,"Olympic_Female_Medals_2020.csv"))
}
print(head(medal_data_2020))
## # A tibble: 6 × 7
## Athlete Nat Gold Silver Bronze Total Year_Position
## <chr> <chr> <int> <int> <int> <int> <chr>
## 1 Emma McKeon AUS 4 0 2 6 Position 61
## 2 Lisa Carrington NZL 3 0 0 3 Position 61
## 3 Elaine Thompson-Herah JAM 3 0 0 3 Position 61
## 4 Kaylee McKeown AUS 3 0 0 3 Position 61
## 5 Katie Ledecky USA 2 2 0 4 Position 61
## 6 Ariarne Titmus AUS 2 1 1 4 Position 61
# Close the RSelenium session
remote_driver$close()
driver$server$stop()
## [1] TRUE
This table shows the women who won the 2020 award
# Because I've already show you about how to get all the years with RSelenium, I won't repeat it again here, just grab 2020 directly.
url2020 <- "https://www.olympedia.org/counts/edition/61"
html_content <- read_html(url2020)
tab <- html_table(html_content, fill = TRUE)
olympic_participators2020 <- tab[[1]]
write_csv(olympic_participators2020, file.path(rowdata_ddir,"olympic_participators2020.csv"))
print(head(olympic_participators2020))
## # A tibble: 6 × 102
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "" ARC ARC ATH ATH BBL BBL BDM BDM BK3 BK3 BKB BKB
## 2 "" m w m w m w m w m w m w
## 3 "AFG" - - 1 1 - - - - - - - -
## 4 "ALB" - - 1 1 - - - - - - - -
## 5 "ALG" - - 4 1 - - - - - - - -
## 6 "AND" - - 1 - - - - - - - - -
## # ℹ 89 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>,
## # X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>, X24 <chr>,
## # X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>,
## # X31 <chr>, X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>, X36 <chr>,
## # X37 <chr>, X38 <chr>, X39 <chr>, X40 <chr>, X41 <chr>, X42 <chr>,
## # X43 <chr>, X44 <chr>, X45 <chr>, X46 <chr>, X47 <chr>, X48 <chr>,
## # X49 <chr>, X50 <chr>, X51 <chr>, X52 <chr>, X53 <chr>, X54 <chr>, …
This table shows the 2020 athletes of all genders
Using RSelenium, I automatically download the JSON of Gender Inequality Index (GII) from UNDP.I also downloaded and plotted a PNG of the GII definition to illustrate what it complements. This secondary data enhances our understanding of the socio-economic factors influencing Olympic sports.I envision that performing multiple linear regression analysis using various dimensions from the GII along with the primary data will yield interesting results.
library(readxl)
library(png)
library(ggplot2)
library(grid)
library(fs)
custom_download_dir <- "rowdata"
# Set Firefox preferences
firefox_profile <- list(
"browser.download.folderList" = 2,
"browser.download.dir" = custom_download_dir,
"browser.helperApps.neverAsk.saveToDisk" = "application/json"
)
# Start Selenium driver with Firefox
driver <- rsDriver(
browser = "firefox",
port = 5482L,
verbose = FALSE,
extraCapabilities = list(
"moz:firefoxOptions" = list(
prefs = firefox_profile
)
)
)
remote_driver <- driver$client
# Navigate to the target URL
gii_url <- "https://hdr.undp.org/data-center/documentation-and-downloads"
remote_driver$navigate(gii_url)
Sys.sleep(3)
# Simulate four click and select process
filter_by_index_xpath <- '//*[@id="rc_select_0"]'
filter_by_index_input <- remote_driver$findElement(using = "xpath", value = filter_by_index_xpath)
filter_by_index_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_index_input$sendKeysToElement(list("GII")) # Type "GII"
Sys.sleep(1)
filter_by_index_input$sendKeysToElement(list(key = "enter")) # Press Enter to confirm selection
Sys.sleep(1)
filter_by_indicator_xpath <- '//*[@id="rc_select_1"]'
filter_by_indicator_input <- remote_driver$findElement(using = "xpath", value = filter_by_indicator_xpath)
filter_by_indicator_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_indicator_input$sendKeysToElement(list("Gender")) # Type "Gender" to choose Gender Inequality Index
Sys.sleep(1)
filter_by_indicator_input$sendKeysToElement(list(key = "enter")) # Press Enter to confirm selection
Sys.sleep(1)
filter_by_year_xpath <- '//*[@id="rc_select_2"]'
filter_by_year_input <- remote_driver$findElement(using = "xpath", value = filter_by_year_xpath)
filter_by_year_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_year_input$sendKeysToElement(list("Select All")) # Type "Select All" to choose all years
Sys.sleep(1)
filter_by_year_input$sendKeysToElement(list(key = "enter")) # Press Enter to confirm selection
Sys.sleep(1)
filter_by_region_xpath <- '//*[@id="rc_select_3"]'
filter_by_region_input <- remote_driver$findElement(using = "xpath", value = filter_by_region_xpath)
filter_by_region_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_region_input$sendKeysToElement(list("Select All Countries")) # Type "Select All Countries" to choose all
Sys.sleep(1)
filter_by_region_input$sendKeysToElement(list(key = "enter")) # Press Enter to confirm selection
Sys.sleep(1)
# Simulate clicking the "search" button
search_button_xpath <- '//*[@id="root"]/div/div/div[1]/div[5]'
search_button <- remote_driver$findElement(using = "xpath", value = search_button_xpath)
search_button$clickElement()
Sys.sleep(3)
# Simulate clicking the "download" button
json_download_xpath <- '//*[@id="root"]/div/div/div[3]/div[1]/div/div[2]/button'
json_download_button <- remote_driver$findElement(using = "xpath", value = json_download_xpath)
json_download_button$clickElement()
Sys.sleep(5)
# Close the new remote driver
remote_driver$close()
driver$server$stop()
## [1] TRUE
# Start a new Selenium driver with a new port
new_driver <- rsDriver(
browser = "firefox",
port = 1482L, # Use a different port
verbose = FALSE,
extraCapabilities = list(
"moz:firefoxOptions" = list(
prefs = firefox_profile
)
)
)
new_remote_driver <- new_driver$client
# Navigate to the target URL
new_remote_driver$navigate(gii_url)
Sys.sleep(3)
# Simulate four click and select process
filter_by_index_xpath <- '//*[@id="rc_select_0"]'
filter_by_index_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_index_xpath)
filter_by_index_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_index_input$sendKeysToElement(list("GII")) # Type "GII"
Sys.sleep(1)
filter_by_index_input$sendKeysToElement(list(key = "enter")) # Press Enter to confirm selection
Sys.sleep(1)
filter_by_indicator_xpath <- '//*[@id="rc_select_1"]'
filter_by_indicator_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_indicator_xpath)
filter_by_indicator_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_indicator_input$sendKeysToElement(list("Gender")) # Type "Gender" to choose Gender Inequality Index
Sys.sleep(1)
filter_by_indicator_input$sendKeysToElement(list(key = "enter")) # Press Enter to confirm selection
Sys.sleep(1)
filter_by_year_xpath <- '//*[@id="rc_select_2"]'
filter_by_year_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_year_xpath)
filter_by_year_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_year_input$sendKeysToElement(list("Select All")) # Type "Select All" to choose all years
Sys.sleep(1)
filter_by_year_input$sendKeysToElement(list(key = "enter")) # Press Enter to confirm selection
Sys.sleep(1)
filter_by_region_xpath <- '//*[@id="rc_select_3"]'
filter_by_region_input <- new_remote_driver$findElement(using = "xpath", value = filter_by_region_xpath)
filter_by_region_input$clickElement() # Click to activate the dropdown
Sys.sleep(1)
filter_by_region_input$sendKeysToElement(list("World"))
Sys.sleep(1)
for (i in 1:5) { # Adjust the number of iterations based on the position of the desired "World" option
filter_by_region_input$sendKeysToElement(list(key = "down_arrow"))
Sys.sleep(0.5)
}
filter_by_region_input$sendKeysToElement(list(key = "enter")) # Confirm selection with Enter
Sys.sleep(1)
# Simulate clicking the "search" button
search_button_xpath <- '//*[@id="root"]/div/div/div[1]/div[5]'
search_button <- new_remote_driver$findElement(using = "xpath", value = search_button_xpath)
search_button$clickElement()
Sys.sleep(3)
# Simulate clicking the "download" button
json_download_xpath <- '//*[@id="root"]/div/div/div[3]/div[1]/div/div[2]/button'
json_download_button <- new_remote_driver$findElement(using = "xpath", value = json_download_xpath)
json_download_button$clickElement()
Sys.sleep(10)
# Close the new remote driver
new_remote_driver$close()
new_driver$server$stop()
## [1] TRUE
# Download the Explanatory GII Image and plot it
img_url <- "https://hdr.undp.org/sites/default/files/styles/1400x/public/images/2022-05/GII_diagram.png?itok=fyvRbzzb"
downloaded_img_path <- file.path(custom_download_dir, "GII_diagram.png")
GET(img_url, write_disk(downloaded_img_path, overwrite = TRUE))
## Response [https://hdr.undp.org/sites/default/files/styles/1400x/public/images/2022-05/GII_diagram.png?itok=fyvRbzzb]
## Date: 2025-01-15 16:59
## Status: 200
## Content-Type: image/png
## Size: 61.5 kB
## <ON DISK> /Users/dongyao/Desktop/my472-at24-final-dongyaook/rowdata/GII_diagram.png
# Read the PNG
img_to_plot <- png::readPNG(downloaded_img_path)
# Display the Image Using ggplot2
ggplot() +
annotation_custom(rasterGrob(img_to_plot), xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
theme_void() +
ggtitle("Explanatory GII Diagram")
> This PNG explains the
GII Dimensions and Indicators
I structured the data to adhere to the principles of tidy data. I created unique IDs, standardized column names, integrated datasets, and transformed data into a long format.
output_ddir <- "output"
if (!dir.exists(output_ddir)) dir.create(output_ddir)
tidydata_ddir <- "output/tidydata"
if (!dir.exists(tidydata_ddir)) dir.create(tidydata_ddir)
#### ORIGINAL DATA 1:olympic_medal_data
olympic_medal_data <- read_csv("rowdata/olympic_medal_data.csv")
# It seems olympic_medal_data is already tidy.But we can add a column of IDs consisting of years and NOCs to make the table look neater
# Fix the formatting errors in 1932 and 1960
data_to_modify <- olympic_medal_data %>%
filter(olympic_year %in% c(1932, 1960)) %>%
mutate(country_participate = trimws(country_participate, which = "right"))
olympic_medal_data <- olympic_medal_data %>%
filter(!(olympic_year %in% c(1932, 1960))) %>%
bind_rows(data_to_modify)
# Make orrections
noc_data <- read_csv("rowdata/120-years-of-olympic-history-athletes-and-results/noc_regions.csv") %>%
# Filter out non-country NOC codes
filter(!NOC %in% c("ANZ", "NFL", "BOH", "TCH", "SAA", "GDR", "FRG", "CRT", "MAL","NBO", "ROT", "TUV", "UNK", "IOA", "NOC", "EUN", "URS", "YUG","SCG", "WIF", "VNM", "YAR", "YMD", "RHO", "HKG")) %>%
# Update alterntive notes
mutate(notes = case_when(
NOC == "GBR" & is.na(notes) ~ "Great Britain",
NOC == "USA" & is.na(notes) ~ "United States",
NOC == "RUS" & is.na(notes) ~ "Russian Empire, Soviet Union",
NOC == "SRI" & is.na(notes) ~ "Ceylon",
NOC == "GER" & is.na(notes) ~ "United Team of Germany, East Germany, West Germany",
NOC == "AUS" & is.na(notes) ~ "Australasia",
NOC == "CZE" & is.na(notes) ~ "Czechoslovakia, Bohemia",
NOC == "SRB" & is.na(notes) ~ "Yugoslavia",
NOC == "TEP" & is.na(notes) ~ "Chinese Taipei",
TRUE ~ notes
)) %>%
separate_rows(notes, sep = ", ") %>%
mutate(notes = trimws(notes))
# Prepare versions of noc_data for flexible joining
combined_noc_data <- noc_data %>%
select(NOC, country_participate = region) %>%
bind_rows(noc_data %>% select(NOC, country_participate = notes))
# Join on combined data and generate ID column
olympic_medal_data <- olympic_medal_data %>%
left_join(combined_noc_data, by = "country_participate") %>%
mutate(
id = ifelse(is.na(NOC), paste(olympic_year, "Unknown", sep = "_"), paste(olympic_year, NOC, sep = "_"))
) %>%
select(id, everything())
# Rename 'country_participate' to 'region_participate' as there may be unsolved geographical disputes
tidy_olympic_medal_data <- olympic_medal_data %>%
rename(region_participate = country_participate)
write_csv(tidy_olympic_medal_data, file.path(tidydata_ddir, "tidy_olympic_medal_data.csv"))
# ORIGINAL DATA 2:Olympic_Female_Medals_2020
olympic_female_medals2020 <- read_csv("rowdata/Olympic_Female_Medals_2020.csv")
# It seems Olympic_Female_Medals_2020 is already tidy.But can do some corrections about formatting
tidy_olympic_female_medals2020 <- olympic_female_medals2020 %>%
rename(Year = Year_Position) %>%
mutate(Year = ifelse(Year == "Position 61", 2020, as.numeric(Year)))
write_csv(tidy_olympic_female_medals2020, file.path(tidydata_ddir, "tidy_olympic_female_medals2020.csv"))
# ORIGINAL DATA 3:athlete_events
# Load the data
athlete_events <- read_csv("rowdata/120-years-of-olympic-history-athletes-and-results/athlete_events.csv")
# It seems Olympic_Female_Medals_2020 is untidy.
# Prepare the data by replacing NA with "None" and fiilter Year
athlete_events <- mutate(athlete_events, Year = as.integer(Year))
summer_athletes <- athlete_events %>%
filter(Year %in% olympic_years) %>%
mutate(Medal = ifelse(is.na(Medal), "None", Medal)) %>%
select(ID, Name, Sex, NOC, Year, Medal)
# Make the table wider by the function summarise
tidy_athlete_events <- summer_athletes %>%
mutate(
Gold = as.integer(Medal == "Gold"),
Silver = as.integer(Medal == "Silver"),
Bronze = as.integer(Medal == "Bronze")
) %>%
group_by(ID, Name, Sex, NOC, Year) %>%
summarise(
Gold = sum(Gold, na.rm = TRUE),
Silver = sum(Silver, na.rm = TRUE),
Bronze = sum(Bronze, na.rm = TRUE),
.groups = "drop"
)
write_csv(tidy_athlete_events, file.path(tidydata_ddir, "tidy_athlete_events.csv"))
# ORIGINAL DATA 4:olympic_participators2020
olympic_participators2020 <- read_csv("rowdata/olympic_participators2020.csv")
# It seems Olympic_Female_Medals_2020 is untidy.
# Extract headers and clean the data
headers <- paste0(olympic_participators2020[1, ], "_", olympic_participators2020[2, ])
headers <- gsub("_NA", "", headers)
headers[1] <- "Country"
olympic_participators2020 <- olympic_participators2020[-c(1, 2), ]
colnames(olympic_participators2020) <- headers
# Pivot it to longer
tidy_olympic_participators2020 <- olympic_participators2020 %>%
pivot_longer(
cols = -Country,
names_to = c("Event", "Gender"),
names_sep = "_",
values_to = "Participants"
) %>%
mutate(
Participants = as.numeric(Participants),
Gender = recode(Gender, m = "Male", w = "Female")
) %>%
drop_na(Participants)
# Generate an ID column
tidy_olympic_participators2020 <- tidy_olympic_participators2020 %>%
mutate(id = paste0("2020_", Country)) %>%
select(id, everything())
write_csv(tidy_olympic_participators2020, file.path(tidydata_ddir, "tidy_olympic_participators2020.csv"))
# ORIGINAL DATA 5:GII from JSON
GII_json <- fromJSON("rowdata/hdr-data.json")
gii_data <- as_tibble(GII_json)
GII_world_json <- fromJSON("rowdata/hdr-data(1).json")
gii_world_data <- as_tibble(GII_world_json)
# It seems they are already tidy.But for future convenience, we will also add ID here following the previous steps
tidy_gii_data <- gii_data %>%
left_join(combined_noc_data, by = c("country" = "country_participate")) %>%
mutate(
id = ifelse(is.na(NOC), paste(year, "Unknown", sep = "_"), paste(year, NOC, sep = "_"))
) %>%
select(id, country, year, value)
write_csv(tidy_gii_data, file.path(tidydata_ddir, "tidy_gii_data.csv"))
tidy_gii_world_data <- gii_world_data %>%
select( year, value)
write_csv(tidy_gii_world_data, file.path(tidydata_ddir, "tidy_gii_world_data.csv"))
# Now we can see tidy versions of all the data
print(head(tidy_olympic_medal_data))
## # A tibble: 6 × 10
## id olympic_year olympic_rank region_participate medal_gold medal_silver
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1896_USA 1896 1 United States 11 7
## 2 1896_GRE 1896 2 Greece 10 18
## 3 1896_GER 1896 3 Germany 6 5
## 4 1896_GER 1896 3 Germany 6 5
## 5 1896_GER 1896 3 Germany 6 5
## 6 1896_FRA 1896 4 France 5 4
## # ℹ 4 more variables: medal_bronze <dbl>, medal_total <dbl>,
## # country_host <dbl>, NOC <chr>
print(head(tidy_olympic_female_medals2020))
## # A tibble: 6 × 7
## Athlete Nat Gold Silver Bronze Total Year
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Emma McKeon AUS 4 0 2 6 2020
## 2 Lisa Carrington NZL 3 0 0 3 2020
## 3 Elaine Thompson-Herah JAM 3 0 0 3 2020
## 4 Kaylee McKeown AUS 3 0 0 3 2020
## 5 Katie Ledecky USA 2 2 0 4 2020
## 6 Ariarne Titmus AUS 2 1 1 4 2020
print(head(tidy_athlete_events))
## # A tibble: 6 × 8
## ID Name Sex NOC Year Gold Silver Bronze
## <dbl> <chr> <chr> <chr> <int> <int> <int> <int>
## 1 1 A Dijiang M CHN 1992 0 0 0
## 2 2 A Lamusi M CHN 2012 0 0 0
## 3 3 Gunnar Nielsen Aaby M DEN 1920 0 0 0
## 4 4 Edgar Lindenau Aabye M DEN 1900 1 0 0
## 5 5 Christine Jacoba Aaftink F NED 1988 0 0 0
## 6 5 Christine Jacoba Aaftink F NED 1992 0 0 0
print(head(tidy_olympic_participators2020))
## # A tibble: 6 × 5
## id Country Event Gender Participants
## <chr> <chr> <chr> <chr> <dbl>
## 1 2020_AFG AFG ATH Male 1
## 2 2020_AFG AFG ATH Female 1
## 3 2020_AFG AFG SHO Male 1
## 4 2020_AFG AFG SWM Male 1
## 5 2020_AFG AFG TKW Male 1
## 6 2020_AFG AFG Total Male 4
print(head(tidy_gii_data))
## # A tibble: 6 × 4
## id country year value
## <chr> <chr> <chr> <dbl>
## 1 2008_AFG Afghanistan 2008 0.69
## 2 2009_AFG Afghanistan 2009 0.696
## 3 2010_AFG Afghanistan 2010 0.707
## 4 2011_AFG Afghanistan 2011 0.718
## 5 2012_AFG Afghanistan 2012 0.734
## 6 2013_AFG Afghanistan 2013 0.712
print(head(tidy_gii_world_data))
## # A tibble: 1 × 2
## year value
## <chr> <dbl>
## 1 2022 0.485
The above tables show row data have become tidy tabular data
I make the data ready to use by following steps: 1.Summarized total participants by gender, region and year 2.Summarized medal result by gender, region and year 3.Add a uniquely identifiable ID column in each table using year and region 4.Calculate host count and rank 5.Calculate historical Olympic medal tally by country
readydata_ddir <- "output/readydata"
if (!dir.exists(readydata_ddir)) dir.create(readydata_ddir)
# FIRST TRANSFORMATIONS: Summarize the data (1896-2016) by country and year, while separately summarizing medals won by men and women.And add ID for future convenience.
country_year_summary <- tidy_athlete_events %>%
group_by(NOC, Year) %>%
summarise(
male_participants = sum(Sex == "M", na.rm = TRUE),
female_participants = sum(Sex == "F", na.rm = TRUE),
total_participants = male_participants + female_participants,
male_gold_medals = sum(Gold == 1 & Sex == "M", na.rm = TRUE),
male_silver_medals = sum(Silver == 1 & Sex == "M", na.rm = TRUE),
male_bronze_medals = sum(Bronze == 1 & Sex == "M", na.rm = TRUE),
female_gold_medals = sum(Gold == 1 & Sex == "F", na.rm = TRUE),
female_silver_medals = sum(Silver == 1 & Sex == "F", na.rm = TRUE),
female_bronze_medals = sum(Bronze == 1 & Sex == "F", na.rm = TRUE),
male_total_medals = male_gold_medals + male_silver_medals + male_bronze_medals,
female_total_medals = female_gold_medals + female_silver_medals + female_bronze_medals,
total_medals = male_total_medals + female_total_medals,
.groups = "drop"
)
deduplicated_combined_noc_data <- combined_noc_data %>%
distinct(NOC, .keep_all = TRUE)
ready_country_year_summary <- country_year_summary %>%
left_join(deduplicated_combined_noc_data, by = "NOC") %>% # Join with deduplicated NOC data
mutate(
id = paste(Year, NOC, sep = "_") # Create ID
) %>%
rename(Region = NOC) %>% # Rename NOC to Region
select(id, Region, everything(), -country_participate)
write_csv(ready_country_year_summary, file.path(readydata_ddir, "ready_country_year_summary.csv"))
print(head(ready_country_year_summary))
## # A tibble: 6 × 15
## id Region Year male_participants female_participants total_participants
## <chr> <chr> <int> <int> <int> <int>
## 1 1936_AFG AFG 1936 15 0 15
## 2 1948_AFG AFG 1948 25 0 25
## 3 1956_AFG AFG 1956 12 0 12
## 4 1960_AFG AFG 1960 12 0 12
## 5 1964_AFG AFG 1964 8 0 8
## 6 1968_AFG AFG 1968 5 0 5
## # ℹ 9 more variables: male_gold_medals <int>, male_silver_medals <int>,
## # male_bronze_medals <int>, female_gold_medals <int>,
## # female_silver_medals <int>, female_bronze_medals <int>,
## # male_total_medals <int>, female_total_medals <int>, total_medals <int>
# SECOND TRANSFORMATIONS: Turn the tidy_olympic_participators2020 into a long format and count the total number of people so that the data is the same as the ready_country_year_summary format
tidy_olympic_participators2020 <- read_csv("output/tidydata/tidy_olympic_participators2020.csv")
ready_olympic_participators2020 <- tidy_olympic_participators2020 %>%
pivot_wider(
names_from = Gender,
values_from = Participants
) %>%
rename(
Region = Country,
male_participants = Male,
female_participants = Female
) %>%
mutate(
male_participants = as.numeric(sapply(male_participants, function(x) ifelse(is.null(x), 0, x))),
female_participants = as.numeric(sapply(female_participants, function(x) ifelse(is.null(x), 0, x))) # Flatten and convert to numeric
) %>%
group_by(id, Region) %>%
summarise(
male_participants = sum(male_participants, na.rm = TRUE),
female_participants = sum(female_participants, na.rm = TRUE),
total_participants = male_participants + female_participants,
.groups = "drop"
) %>%
mutate(Year = 2020) # Add a fixed Year column as 2020
ready_olympic_participators2020 <- ready_olympic_participators2020 %>%
select(id, Region, Year, male_participants, female_participants, total_participants)
write_csv(ready_olympic_participators2020, file.path(readydata_ddir, "ready_olympic_participators2020.csv"))
print(head(ready_olympic_participators2020))
## # A tibble: 6 × 6
## id Region Year male_participants female_participants total_participants
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2020_AFG AFG 2020 8 2 10
## 2 2020_ALB ALB 2020 12 6 18
## 3 2020_ALG ALG 2020 50 26 76
## 4 2020_AND AND 2020 2 2 4
## 5 2020_ANG ANG 2020 8 32 40
## 6 2020_ANT ANT 2020 6 6 12
# THIRD TRANSFORMATIONS: Rename columns add IDs to tidy_olympic_female_medals2020, tidy_olympic_medal_data,making their format looks like ready_country_year_summary
tidy_olympic_female_medals2020 <- read_csv("output/tidydata/tidy_olympic_female_medals2020.csv")
ready_olympic_female_medals2020 <- tidy_olympic_female_medals2020 %>%
rename(
Region = Nat
) %>%
select(-Athlete) %>%
group_by(Region, Year) %>%
summarise(
female_gold_medals = sum(Gold, na.rm = TRUE),
female_silver_medals = sum(Silver, na.rm = TRUE),
female_bronze_medals = sum(Bronze, na.rm = TRUE),
female_total_medals = female_gold_medals + female_silver_medals + female_bronze_medals, # Calculate total medals
.groups = "drop"
) %>%
mutate(
id = paste(Year, Region, sep = "_")
) %>%
select(id, Region, Year, female_gold_medals, female_silver_medals, female_bronze_medals, female_total_medals) # Reorder columns
tidy_olympic_medal_data <- read_csv("output/tidydata/tidy_olympic_medal_data.csv")
ready_olympic_medal_data <- tidy_olympic_medal_data %>%
rename(
Year = olympic_year,
Region = NOC,
rank = olympic_rank,
whole_name = region_participate,
total_gold_medals = medal_gold,
total_silver_medals = medal_silver,
total_bronze_medals = medal_bronze,
total_medals = medal_total
) %>%
select(
Year, Region, rank, whole_name, total_gold_medals, total_silver_medals, total_bronze_medals, total_medals,country_host
) # Select and reorder the required columns
write_csv(ready_olympic_female_medals2020, file.path(readydata_ddir, "ready_olympic_female_medals2020.csv"))
write_csv(ready_olympic_medal_data, file.path(readydata_ddir, "ready_olympic_medal_data.csv"))
print(head(ready_olympic_female_medals2020))
## # A tibble: 6 × 7
## id Region Year female_gold_medals female_silver_medals
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020_AUS AUS 2020 13 1
## 2 2020_BRA BRA 2020 1 1
## 3 2020_CAN CAN 2020 1 1
## 4 2020_CHN CHN 2020 9 4
## 5 2020_GBR GBR 2020 2 2
## 6 2020_HUN HUN 2020 1 1
## # ℹ 2 more variables: female_bronze_medals <dbl>, female_total_medals <dbl>
print(head(ready_olympic_medal_data))
## # A tibble: 6 × 9
## Year Region rank whole_name total_gold_medals total_silver_medals
## <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1896 USA 1 United States 11 7
## 2 1896 GRE 2 Greece 10 18
## 3 1896 GER 3 Germany 6 5
## 4 1896 GER 3 Germany 6 5
## 5 1896 GER 3 Germany 6 5
## 6 1896 FRA 4 France 5 4
## # ℹ 3 more variables: total_bronze_medals <dbl>, total_medals <dbl>,
## # country_host <dbl>
# FOURTH TRANSFORMATIONS: Calculate host count and rank
host_count_rank <- ready_olympic_medal_data %>%
group_by(Region) %>%
summarise(
host_count = sum(country_host == 1, na.rm = TRUE), # Count hosting years
.groups = "drop"
) %>%
arrange(desc(host_count)) %>% # Rank by host count, descending
mutate(rank = row_number()) # Add ranking column
host_count_rank <- host_count_rank %>%
select(rank, Region, host_count, everything())
print(head(host_count_rank, 10))
## # A tibble: 10 × 3
## rank Region host_count
## <int> <chr> <int>
## 1 1 FRA 4
## 2 2 GER 4
## 3 3 USA 3
## 4 4 GRE 2
## 5 5 JPN 2
## 6 6 AUS 1
## 7 7 AUT 1
## 8 8 BEL 1
## 9 9 BRA 1
## 10 10 CAN 1
# FIFTH TRANSFORMATIONS: Calculate historical Olympic medal tally by country
historical_medal_tally <- ready_olympic_medal_data %>%
group_by(Region) %>%
summarise(
total_gold_medals = sum(total_gold_medals, na.rm = TRUE), # Sum of gold medals
total_silver_medals = sum(total_silver_medals, na.rm = TRUE), # Sum of silver medals
total_bronze_medals = sum(total_bronze_medals, na.rm = TRUE), # Sum of bronze medals
total_medals = sum(total_medals, na.rm = TRUE), # Total medals (gold + silver + bronze)
.groups = "drop"
) %>%
arrange(desc(total_medals)) %>% # Rank by total medals
mutate(rank = row_number()) # Add ranking column
historical_medal_tally <- historical_medal_tally %>%
select(rank, Region, total_medals, everything())
print(head(historical_medal_tally, 10))
## # A tibble: 10 × 6
## rank Region total_medals total_gold_medals total_silver_medals
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 GER 2837 891 925
## 2 2 USA 2768 1107 879
## 3 3 RUS 1862 690 573
## 4 4 GBR 991 305 342
## 5 5 FRA 829 245 280
## 6 6 CHN 727 303 226
## 7 7 ITA 667 231 206
## 8 8 AUS 610 183 196
## 9 9 JPN 544 189 162
## 10 10 HUN 532 187 162
## # ℹ 1 more variable: total_bronze_medals <dbl>
I used the following queries to lay a foundation for visualization: 1.I calculated and compared the participation rates of male and female athletes across different Olympic years and regions(includes “world”), revealing trends in gender equality in sports. 2.I deine a “medal efficiency” metric, demonstrating each region(includes “world”) and each gender’s effectiveness in converting participation into medals.
# As we've linked all the tables with ids, now we can use SQL to build a detailed database of Olympic participation and awards! It contains the region, year, gender, number of medals won separately.
# Let us start with Gender participation rate.
library(DBI)
library(RSQLite)
# Define directory for database
database_ddir <- "output/database"
if (!dir.exists(database_ddir)) dir.create(database_ddir)
db <- dbConnect(RSQLite::SQLite(), paste0(database_ddir, "olympic-db.sqlite"))
dbWriteTable(db, "country_year_summary", ready_country_year_summary, overwrite = TRUE)
dbWriteTable(db, "olympic_female_medals2020", ready_olympic_female_medals2020, overwrite = TRUE)
dbWriteTable(db, "olympic_medal_data", ready_olympic_medal_data, overwrite = TRUE)
dbWriteTable(db, "olympic_participators2020", ready_olympic_participators2020, overwrite = TRUE)
# SQL Query to Combine 2020 Data Explicitly with Historical Data
combine_query <- "
SELECT
cys.id AS id,
cys.Region AS Region,
cys.Year AS Year,
COALESCE(cys.male_participants, part.male_participants) AS male_participants,
COALESCE(cys.female_participants, part.female_participants) AS female_participants,
COALESCE(cys.total_participants, part.total_participants) AS total_participants
FROM country_year_summary cys
FULL OUTER JOIN olympic_participators2020 part ON cys.id = part.id
WHERE cys.Year IS NOT NULL OR part.Year = 2020
"
# Execute the query and save new table
combined_data <- dbGetQuery(db, combine_query)
dbWriteTable(db, "combined_data_with_2020", combined_data, overwrite = TRUE)
# Gender participation rates by country and year
id_gender_participation_rate_query <- "
SELECT
id, Region, Year,
COALESCE(male_participants, 0) AS male_participants,
COALESCE(female_participants, 0) AS female_participants,
COALESCE(total_participants, 0) AS total_participants,
ROUND(COALESCE(male_participants, 0) * 1.0 / COALESCE(total_participants, 1), 3) AS male_participation_rate,
ROUND(COALESCE(female_participants, 0) * 1.0 / COALESCE(total_participants, 1), 3) AS female_participation_rate
FROM (
SELECT
id, Region, Year, male_participants, female_participants,
total_participants
FROM country_year_summary
UNION ALL
SELECT
id, Region, Year, male_participants, female_participants,
total_participants
FROM olympic_participators2020
) combined
"
id_gender_participation_rate <- dbGetQuery(db, id_gender_participation_rate_query)
write_csv(id_gender_participation_rate, file.path(database_ddir, "id_gender_participation_rate.csv"))
# Global gender participation rates and year
historical_gender_participation_rate_query <- "
SELECT
Year,
SUM(COALESCE(male_participants, 0)) AS total_male_participants,
SUM(COALESCE(female_participants, 0)) AS total_female_participants,
SUM(COALESCE(total_participants, 0)) AS total_participants,
ROUND(SUM(COALESCE(male_participants, 0)) * 1.0 / SUM(COALESCE(total_participants, 0)), 3) AS global_male_participation_rate,
ROUND(SUM(COALESCE(female_participants, 0)) * 1.0 / SUM(COALESCE(total_participants, 0)), 3) AS global_female_participation_rate
FROM (
SELECT
Year, male_participants, female_participants, total_participants
FROM country_year_summary
UNION ALL
SELECT
Year, male_participants, female_participants, total_participants
FROM olympic_participators2020
) combined
GROUP BY Year
ORDER BY Year
"
historical_gender_participation_rate <- dbGetQuery(db, historical_gender_participation_rate_query)
write_csv(historical_gender_participation_rate, file.path(database_ddir, "historical_gender_participation_rate.csv"))
# Close the database connection
dbDisconnect(db)
# Then, We introduce "Medal Efficiency" to measure how effectively a country converts its participants into medals.
# we define: Gold = 4 point, Silver = 2 point, Bronze = 1 point, None medal = 0 point;
# Medal Efficiency = sum(points)/sum(Participants)
database_ddir <- "output/database"
if (!dir.exists(database_ddir)) dir.create(database_ddir)
db <- dbConnect(RSQLite::SQLite(), paste0(database_ddir, "olympic-db.sqlite"))
dbWriteTable(db, "country_year_summary", ready_country_year_summary, overwrite = TRUE)
dbWriteTable(db, "olympic_female_medals2020", ready_olympic_female_medals2020, overwrite = TRUE)
dbWriteTable(db, "olympic_medal_data", ready_olympic_medal_data, overwrite = TRUE)
dbWriteTable(db, "olympic_participators2020", ready_olympic_participators2020, overwrite = TRUE)
# Prepare data of 2020
dbExecute(db, "DELETE FROM country_year_summary WHERE Year = 2020")
## [1] 0
update_2020_data <- "
INSERT OR REPLACE INTO country_year_summary (
id, Region, Year,
male_participants, female_participants,
male_gold_medals, male_silver_medals, male_bronze_medals,
female_gold_medals, female_silver_medals, female_bronze_medals
)
SELECT
'2020_' || op.Region AS id,
op.Region,
2020 AS Year,
op.male_participants, op.female_participants,
(omd.total_gold_medals - ofm.female_gold_medals) AS male_gold_medals,
(omd.total_silver_medals - ofm.female_silver_medals) AS male_silver_medals,
(omd.total_bronze_medals - ofm.female_bronze_medals) AS male_bronze_medals,
ofm.female_gold_medals,
ofm.female_silver_medals,
ofm.female_bronze_medals
FROM
olympic_participators2020 op
JOIN
olympic_female_medals2020 ofm ON op.Region = ofm.Region
JOIN
olympic_medal_data omd ON op.Region = omd.Region AND omd.Year = 2020
"
dbExecute(db, update_2020_data)
## [1] 14
# Calculate historical medal efficiency by ID
dbExecute(db, "DROP TABLE IF EXISTS id_medal_efficiency")
## [1] 0
id_medal_efficiency_query <- "
CREATE TABLE id_medal_efficiency AS
SELECT
id, Region, Year,
ROUND(COALESCE((4 * male_gold_medals + 2 * male_silver_medals + male_bronze_medals) * 1.0 / NULLIF(male_participants, 0), 0), 3) AS male_medal_efficiency,
ROUND(COALESCE((4 * female_gold_medals + 2 * female_silver_medals + female_bronze_medals) * 1.0 / NULLIF(female_participants, 0), 0), 3) AS female_medal_efficiency,
ROUND(COALESCE((4 * (male_gold_medals + female_gold_medals) + 2 * (male_silver_medals + female_silver_medals) + (male_bronze_medals + female_bronze_medals)) * 1.0 / NULLIF((male_participants + female_participants), 0), 0), 3) AS total_medal_efficiency
FROM country_year_summary;
"
dbExecute(db, id_medal_efficiency_query)
## [1] 0
# Calculate global annual medal efficiency by gender
dbExecute(db, "DROP TABLE IF EXISTS global_medal_efficiency")
## [1] 0
global_medal_efficiency_query <- "
CREATE TABLE global_medal_efficiency AS
SELECT
Year,
ROUND(COALESCE(SUM(4 * male_gold_medals + 2 * male_silver_medals + male_bronze_medals) * 1.0 / NULLIF(SUM(male_participants), 0), 0), 3) AS global_male_medal_efficiency,
ROUND(COALESCE(SUM(4 * female_gold_medals + 2 * female_silver_medals + female_bronze_medals) * 1.0 / NULLIF(SUM(female_participants), 0), 0), 3) AS global_female_medal_efficiency,
ROUND(COALESCE(SUM(4 * (male_gold_medals + female_gold_medals) + 2 * (male_silver_medals + female_silver_medals) + (male_bronze_medals + female_bronze_medals)) * 1.0 / NULLIF(SUM(male_participants + female_participants), 0), 0), 3) AS global_total_medal_efficiency
FROM country_year_summary
GROUP BY Year;
"
dbExecute(db, global_medal_efficiency_query)
## [1] 0
id_medal_efficiency <- dbGetQuery(db, "SELECT * FROM id_medal_efficiency")
global_medal_efficiency <- dbGetQuery(db, "SELECT * FROM global_medal_efficiency")
write_csv(id_medal_efficiency, file.path(database_ddir, "id_medal_efficiency.csv"))
write_csv(global_medal_efficiency, file.path(database_ddir, "global_medal_efficiency.csv"))
dbDisconnect(db)
library(plotly)
vis_ddir <- "output/vis"
if (!dir.exists(vis_ddir)) dir.create(vis_ddir)
tidy_gii_world_data <- read_csv("output/tidydata/tidy_gii_world_data.csv") %>%
mutate(Year = as.numeric(year))
historical_gender_participation_rate<- read_csv("output/database/historical_gender_participation_rate.csv")
historical_gender_participation_rate <- historical_gender_participation_rate %>%
mutate(total_participation = global_male_participation_rate + global_female_participation_rate,
male_prop = global_male_participation_rate / total_participation,
female_prop = global_female_participation_rate / total_participation,
male_y_start = 1 - male_prop, # Starting point for male
male_y_end = 1, # Ending point for male at 100%
female_y_start = 0, # Starting point for female at 0%
female_y_end = female_prop, # Ending point for female
xmin = Year - 0.9, # Adjusted for better visualization
xmax = Year + 0.9)
participation_plot <- ggplot(historical_gender_participation_rate, aes(x = Year)) +
geom_rect(aes(xmin = xmin, xmax = xmax, ymin = male_y_start, ymax = male_y_end, fill = "Male")) +
geom_rect(aes(xmin = xmin, xmax = xmax, ymin = female_y_start, ymax = female_y_end, fill = "Female")) +
geom_line(aes(y = female_prop, group = 1, color = "Female Participation"), size = 0.8 ) +
geom_point(aes(y = female_prop), color = "yellow", size = 1, shape = 21, fill = "yellow") +
geom_line(data = tidy_gii_world_data, aes(x = year, y = value, group = 1, color = "GII Index"), size = 0.8) +
scale_fill_manual(values = c("Male" = "#ADD8E6", "Female" = "#FFB6C1")) +
scale_color_manual(values = c("Female Participation" = "yellow", "GII Index" = "red")) +
labs(title = "Global Participation in Summer Olympics (1896-2020) with GII",
x = "Year",
y = "Proportion of Total Participation",
fill = "Gender",
color = "Index & Trends") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom", # Move legend to the bottom
legend.title = element_blank() # Optionally remove legend titles
)
# Convert to interactive plotly object
participation_plotly <- ggplotly(participation_plot, tooltip = c("x", "y", "text")) %>%
layout(
hovermode = "closest",
xaxis = list(showspikes = TRUE, spikedash = "solid", spikecolor = "grey", spikethickness = 0.5),
yaxis = list(showspikes = TRUE, spikedash = "solid", spikecolor = "grey", spikethickness = 0.5)
)
# Save the interactive plot to HTML
htmlwidgets::saveWidget(participation_plotly, file.path(vis_ddir, "participation_plotly.html"))
print(participation_plotly)
The first visualization showed a trend where lower Global Gender Inequality Index (GII) values correlate with higher female participation rates in the Olympics, suggesting that gender equality fosters greater female involvement in sports.
global_medal_efficiency <- read_csv("output/database/global_medal_efficiency.csv")
# Create a base line chart
efficiency_plot <- ggplot(global_medal_efficiency, aes(x = Year)) +
geom_line(aes(y = global_male_medal_efficiency, color = "Male Efficiency"), size = 1) +
geom_line(aes(y = global_female_medal_efficiency, color = "Female Efficiency"), size = 1) +
geom_line(aes(y = global_total_medal_efficiency, color = "Global Efficiency"), size = 0.5, linetype = "dashed") +
scale_color_manual(values = c("Male Efficiency" = "#ADD8E6",
"Female Efficiency" = "#FFB6C1",
"Global Efficiency" = "#636363")) +
labs(
title = "Medal Efficiency by Gender in Summer Olympics (1896-2020)",
x = "Year",
y = "Medal Efficiency",
color = "Category"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.title = element_blank()
)
# Convert to an interactive Plotly
efficiency_plotly <- ggplotly(efficiency_plot, tooltip = c("x", "y", "color")) %>%
layout(
hovermode = "closest",
xaxis = list(showspikes = TRUE),
yaxis = list(showspikes = TRUE)
)
htmlwidgets::saveWidget(efficiency_plotly,file.path(vis_ddir, "efficiency_plotly.html"))
efficiency_plotly
The second chart revealed that despite historically lower participation rates, women’s medal efficiency in the Olympics has generally outperformed that of men, indicating that female athletes achieve high success relative to their participation numbers.
id_medal_efficiency <- read.csv("output/database/id_medal_efficiency.csv")
# Now let's define a concept: Gender equality level
# I use the reciprocal of the gender inequality index to represent the gender equality level, but this is not a serious academic concept, just for convenience
# Filter data for the UK from 1990 to 2020
uk_medal_efficiency <- id_medal_efficiency %>%
filter(Region == "GBR", Year >= 1990, Year <= 2020) %>%
select(Year, female_medal_efficiency)
us_gii_data <- tidy_gii_data %>%
filter(country == "United Kingdom") %>%
mutate(year = as.numeric(year),
GII_reciprocal = 1 - value) %>%
filter(year >= 1990, year <= 2020) %>%
select(year, GII_reciprocal)
uk_merged_data <- left_join(uk_medal_efficiency, us_gii_data, by = c("Year" = "year"))
# Create a ggplot object
uk_plot <- ggplot(uk_merged_data, aes(x = GII_reciprocal, y = female_medal_efficiency)) +
geom_point(color = "#636363") +
geom_smooth(method = "lm", color = "#EF476F") +
labs(
title = "UK Case Study on Gender Equality and Performance (1990-2020)",
x = " Gender Equality Level (1-GII)",
y = "Female Medal Efficiency"
) +
theme_minimal()
# Convert to an interactive Plotly graph
uk_plotly <- ggplotly(uk_plot, tooltip = c("x", "y")) %>%
layout(
hovermode = "closest",
xaxis = list(showspikes = TRUE),
yaxis = list(showspikes = TRUE)
)
htmlwidgets::saveWidget(uk_plotly,file.path(vis_ddir, "uk_plotly.html"))
uk_plotly
The third visualization used the UK as a case study to demonstrate a positive correlation between gender equality and female athletic performance, supporting the idea that societal gender equality advances women’s success in sports.
Finally, you can see the storage structure of the entire project through the following code, where rowdata contains csv, JSON and PNG, and output has processed tables and plotly HTML.
current_dir <- getwd()
files_list <- list.files(current_dir, recursive = TRUE, full.names = FALSE)
files_df <- data.frame(FileName = files_list)
print(files_df)
## FileName
## 1 MY472-AT24-final-instructions.md
## 2 MY472-AT24-final-report_files/figure-html/3.3-1.png
## 3 MY472-AT24-final-report.Rmd
## 4 output/database/global_medal_efficiency.csv
## 5 output/database/historical_gender_participation_rate.csv
## 6 output/database/id_gender_participation_rate.csv
## 7 output/database/id_medal_efficiency.csv
## 8 output/databaseolympic-db.sqlite
## 9 output/readydata/ready_country_year_summary.csv
## 10 output/readydata/ready_olympic_female_medals2020.csv
## 11 output/readydata/ready_olympic_medal_data.csv
## 12 output/readydata/ready_olympic_participators2020.csv
## 13 output/tidydata/tidy_athlete_events.csv
## 14 output/tidydata/tidy_gii_data.csv
## 15 output/tidydata/tidy_gii_world_data.csv
## 16 output/tidydata/tidy_olympic_female_medals2020.csv
## 17 output/tidydata/tidy_olympic_medal_data.csv
## 18 output/tidydata/tidy_olympic_participators2020.csv
## 19 output/vis/efficiency_plotly_files/crosstalk-1.2.1/css/crosstalk.min.css
## 20 output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.js
## 21 output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.js.map
## 22 output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js
## 23 output/vis/efficiency_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js.map
## 24 output/vis/efficiency_plotly_files/crosstalk-1.2.1/scss/crosstalk.scss
## 25 output/vis/efficiency_plotly_files/htmltools-fill-0.5.8.1/fill.css
## 26 output/vis/efficiency_plotly_files/htmlwidgets-1.6.4/htmlwidgets.js
## 27 output/vis/efficiency_plotly_files/jquery-3.5.1/jquery-AUTHORS.txt
## 28 output/vis/efficiency_plotly_files/jquery-3.5.1/jquery.js
## 29 output/vis/efficiency_plotly_files/jquery-3.5.1/jquery.min.js
## 30 output/vis/efficiency_plotly_files/jquery-3.5.1/jquery.min.map
## 31 output/vis/efficiency_plotly_files/plotly-binding-4.10.4/plotly.js
## 32 output/vis/efficiency_plotly_files/plotly-htmlwidgets-css-2.11.1/plotly-htmlwidgets.css
## 33 output/vis/efficiency_plotly_files/plotly-main-2.11.1/plotly-latest.min.js
## 34 output/vis/efficiency_plotly_files/typedarray-0.1/typedarray.min.js
## 35 output/vis/efficiency_plotly.html
## 36 output/vis/participation_plotly_files/crosstalk-1.2.1/css/crosstalk.min.css
## 37 output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.js
## 38 output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.js.map
## 39 output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js
## 40 output/vis/participation_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js.map
## 41 output/vis/participation_plotly_files/crosstalk-1.2.1/scss/crosstalk.scss
## 42 output/vis/participation_plotly_files/htmltools-fill-0.5.8.1/fill.css
## 43 output/vis/participation_plotly_files/htmlwidgets-1.6.4/htmlwidgets.js
## 44 output/vis/participation_plotly_files/jquery-3.5.1/jquery-AUTHORS.txt
## 45 output/vis/participation_plotly_files/jquery-3.5.1/jquery.js
## 46 output/vis/participation_plotly_files/jquery-3.5.1/jquery.min.js
## 47 output/vis/participation_plotly_files/jquery-3.5.1/jquery.min.map
## 48 output/vis/participation_plotly_files/plotly-binding-4.10.4/plotly.js
## 49 output/vis/participation_plotly_files/plotly-htmlwidgets-css-2.11.1/plotly-htmlwidgets.css
## 50 output/vis/participation_plotly_files/plotly-main-2.11.1/plotly-latest.min.js
## 51 output/vis/participation_plotly_files/typedarray-0.1/typedarray.min.js
## 52 output/vis/participation_plotly.html
## 53 output/vis/uk_plotly_files/crosstalk-1.2.1/css/crosstalk.min.css
## 54 output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.js
## 55 output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.js.map
## 56 output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js
## 57 output/vis/uk_plotly_files/crosstalk-1.2.1/js/crosstalk.min.js.map
## 58 output/vis/uk_plotly_files/crosstalk-1.2.1/scss/crosstalk.scss
## 59 output/vis/uk_plotly_files/htmltools-fill-0.5.8.1/fill.css
## 60 output/vis/uk_plotly_files/htmlwidgets-1.6.4/htmlwidgets.js
## 61 output/vis/uk_plotly_files/jquery-3.5.1/jquery-AUTHORS.txt
## 62 output/vis/uk_plotly_files/jquery-3.5.1/jquery.js
## 63 output/vis/uk_plotly_files/jquery-3.5.1/jquery.min.js
## 64 output/vis/uk_plotly_files/jquery-3.5.1/jquery.min.map
## 65 output/vis/uk_plotly_files/plotly-binding-4.10.4/plotly.js
## 66 output/vis/uk_plotly_files/plotly-htmlwidgets-css-2.11.1/plotly-htmlwidgets.css
## 67 output/vis/uk_plotly_files/plotly-main-2.11.1/plotly-latest.min.js
## 68 output/vis/uk_plotly_files/typedarray-0.1/typedarray.min.js
## 69 output/vis/uk_plotly.html
## 70 README.md
## 71 rowdata/120-years-of-olympic-history-athletes-and-results.zip
## 72 rowdata/120-years-of-olympic-history-athletes-and-results/athlete_events.csv
## 73 rowdata/120-years-of-olympic-history-athletes-and-results/noc_regions.csv
## 74 rowdata/GII_diagram.png
## 75 rowdata/hdr-data.json
## 76 rowdata/hdr-data(1).json
## 77 rowdata/Olympic_Female_Medals_2020.csv
## 78 rowdata/olympic_medal_data.csv
## 79 rowdata/olympic_participators2020.csv
library(tidyverse)
rmd_file <- "MY472-AT24-final-report.Rmd" # path to your Rmd file
read_file(rmd_file) %>% # read the file as a text file
str_squish() %>% # remove all extra white space
str_replace("^.+?output.+?[-]{3}", "") %>% # remove header
str_replace_all("``` *[{].+?```", " ") %>% # remove code chunks
str_replace_all("<![-].+?-->", " ") %>% # remove rmd comments
str_replace_all("[!]?\\[.+?\\][(].+?[)]", " ") %>% # remove links
str_replace_all("(^|\\s+)[^A-Za-z0-9]+", " ") %>% # remove symbols (1)
str_replace_all("[^A-Za-z0-9]+($|\\s+)", " ") %>% # remove symbols (2)
str_count("\\S+") %>%
paste("The document is", ., "words.") %>%
print()
## [1] "The document is 681 words."
>>>>>>> 103afe631bd05b0cf645a264ad34b0e027198e03